Estimating the Probability of Approximate Matches
نویسندگان
چکیده
While considerable effort and some progress has been made on developing an analytic formula for the probability of an approximate match, such work has not achieved fruition [4, 6, 2, 1]. Therefore, we consider here the development of an unbiased estimation procedure for determining said probability given a specific string P ∈ Σ and a specific cost function δ for weighting edit operations. Problems of this type are of general interest, see for example a recent paper [5] giving an unbiased estimator for counting the words of a fixed length in a regular language. We were further motivated by a particular application arising in the pattern matching system Anrep designed by us for use in genomic sequence analysis [8, 11]. Anrep accomplishes a search for a complex pattern by backtracking over subprocedures that find approximate matches. The subpatterns are searched in an order that attempts to minimize the expected running time of the search. Determining this optimal backtrack order requires a reasonably accurate estimate of the probability with which one will find an approximate match to each subpattern. Given that the probabilities involved are frequently 10 or less, the simple expedient of measuring match frequency over a random text of several thousand characters has been less than satisfactory. The unbiased estimator herein is shown to give good results in a matter of a thousand samples even for small probability patterns. Thus it is expected to improve the performance of Anrep and may have utility in estimating the significance of similarity searches. Proceeding formally, suppose we are given
منابع مشابه
The Impact of the First Goal in the Final Result of the Futsal Match
Among the many technical and tactical aspects of the behavior of players, the goals are the most studied. The goal is the key to success for teams and its analysis in all matches of a major futsal tournament (World Cup) that allows multiple assessments. The aim of this study was to analyze the impact of the first goal for the final result in the futsal match, identifying the team that scored th...
متن کاملPerformance Analysis of Device to Device Communications Overlaying/Underlaying Cellular Network
Minimizing the outage probability and maximizing throughput are two important aspects in device to device (D2D) communications, which are greatly related to each other. In this paper, first, the exact formulas of the outage probability for D2D communications underlaying or overlaying cellular network are derived which jointly experience Additive White Gaussian Noise (AWGN) and Rayleigh multipat...
متن کاملUsing of Contingent Valuation Method in Estimating of recreational Value of Gandoman International Wetland
The functions of wetlands are water supply, livestock supply, wildlife refuge, employment and human uses, research and training, climate adjustment and recreational services. The purpose of this study is to estimating the economic value of Gandoman international wetland with area of over 1100 hectares in 2017 and provide it to the relevant authorities for its further protection. Data collection...
متن کاملA hybrid model for estimating the probability of default of corporate customers
Credit risk estimation is a key determinant for the success of financial institutions. The aim of this paper is presenting a new hybrid model for estimating the probability of default of corporate customers in a commercial bank. This hybrid model is developed as a combination of Logit model and Neural Network to benefit from the advantages of both linear and non-linear models. For model verific...
متن کاملGeometric hashing: error analysis
We develop a model for predicting the probability of incorrect, random matches when using a geometric hashing based recognition scheme. To estimate the vote for random matches we approximate the voting function by a discrete function and use the binomial distribution. The resulting probability distribution of votes for random matches is compared with experiments that have a set of artificially ...
متن کامل